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Abstract. Matrix Distributed Processing is a collection of classes and functions written in C++ for fast development 
of efficient parallel algorithms for the most general lattice/grid application. FermiQCD is an Object Oriented Lattice 
QCD application of MDP, under development at Fermilab. 



INTRODUCTION 

It is believed that, down to the smallest observed 
length scale, fundamental interactions in nature are local. 
This means that the equations one writes to describe the 
physical world are, in the majority of cases, local differ- 
ential equations or systems of local differential equations. 
They can be non-linear, strongly coupled and stochastic 
but, if they describe a fundamental interaction, they are 
also local. 

With very few exceptions, these equations do not have 
an exact analytical solution, therefore they must be solved 
numerically. This is done by discretizing the space on 
which the equations are defined and applying iteratively 
the appropriate algorithm. 

The most general local differential equation contains 
derivatives which, after discretization, becomes quasi- 
local terms. For example 

where a is the lattice spacing introduced in the discretiza- 
tion process. "Quasi-local" here means that a local term 
(the n-th derivative in x) becomes a linear combination of 
non-local terms localized within a radius na formx. 

The typical iterative algorithm that solves a local dif- 
ferential equation has the form 

ITERATE Vx: = (2) 

where H is some function of the position, x, and of the 
value of the field, (|)(.v), in some neighborhood of x within 
|x — jI < na. 

The exact form of H is not completely determined by 
initial differential equation, since there are different in- 
equivalent ways to discretize it. A difference in the dis- 
cretization procedure means a difference in the conver- 
gence speed and a difference in the discretization errors 
(that vanish with a 0). 



Finding the numerical solution can be very costly but 
these algorithms can be very efficiently parallelized (us- 
ing a supercomputer and/or a cluster of workstations). 
This is because one can partition the space x on which 
the field (|)(x) is defined over different CPUs. Each CPU 
applies the algorithm, eq. 2, to the local sites and this 
can be done in parallel. Because of the quasi-locality of 
the function H it is necessary that each process maintains 
an updated copy of the field variables for each y in 
the neighborhood of the local sites x. Each CPU will 
distinguish between local sites {x} (the sites stored by 
the CPU), boundary sites {y} (sites that are not local but 
a local copy exists because they must be accessed) and 
hidden sites (sites that do not affect the computation per- 
formed by that particular CPU). 

For every parallel algorithm to work it is necessary to 
keep the boundary sites updated, i.e. if a field variable at 
a particular site is modified by one of the CPU, its copies 
(maintained by different CPUs) have to be modified ac- 
cordingly. This requires communication among the dif- 
ferent CPUs. 

Matrix Distributed Processing (MDP) (1) provides the 
tools to implement this kind of algorithms on a computer 
in an easy and object oriented way. It also provides some 
basic classes for matrix manipulations, statistical analysis 
and a random number generator. 

Communications in MDP are based on Message Pass- 
ing Interface which is de facto a standard for parallel ap- 
plications. MPI calls are hidden inside the basic classes 
that constitute MDP and are invisible to the user. 



EXAMPLE 

As a first example of an application, let us consider 
here the following problem: 

Problem: Solve numerically, in U, the following equa- 
tion 

v2f/ = cos(f/ + y) (3) 



where U{x) and V{x) are fields of3x3 matrices defined 
on a four dimensional space x with the topology of a torus 
r^. y (x) is initialized with random SU (3) matrices. (In 
this example U (x) plays the role of the field (|)(x) of the 
last section.) 

Solution: The first step is to discretize the space on 
which the fields are defined by approximating it with a A^"* 
lattice (with N = S). The second step consists in writing 
down a discretized form of eq. 3, using eq. 1. In adimen- 
sional units (defined by imposing a = 1) one obtains 

U{x) = H{x,U) 

= i[cos(t/(x)+y(x)) + 
8 

u{x+6) + u{x-b) + 
f/(x+i) + f/(x-i) + 

f/(x + 2) + t/(x-2) + 

U{x + 3) + U{x-3)]+0{a) (4) 

wherex±nis>'= (xo±5o,n,xi ±5i,„,X2±52,n,X3±63,„). 

The third and usually non-trivial step is writing a com- 
puter program that implements, in a parallel way, the re- 
cursive relation of eq. 4. 

Here is how this can be implemented using MDP: 

01: #include "MDP_Lib2.h" 

02: #include "MDP_MPI.h" 

03: int main(int argc, char **argv) { 



04: mpi . open_wormholes (argc, argv) ; 

05: int box [ 4 ] = { 8 , 8 , 8 , 8 } ; 

06: generic_lattice space ( 4 , box) ; 

07: Matrix_field U (space, 3, 3) ; 

08: Matrix_f ield V (space, 3, 3) ; 

09: site x (space); 

10: forallsites (x) { 

11: U(x)=0; 

12: V (x) =space . random (x) . SU (3) ; 

13: }; 

14 : U. update () ; 

15: V.updateO; 

16: for{int i=0; i<100; i++) { 

17: forallsites (x) 

18: U(x)=0.125* (cos (U(x)-HV(x) )-H 

19: U (x+0) +U (x-0) + 

20: U (x+1) +U (x-1) + 

21: U (x-l-2) -l-U (x-2) -I- 

22: U (X-H3) -HU (x-3) ) ; 

2 3 : U. update () ; 

24: }; 

25: V. save ("V_field.dat") ; 

26: U. save ("U_field.dat") ; 

27: mpi . close_wormholes () ; 

2 8 : return 0; 



29: }; 



• lines 1,2 read the MDP libraries; 

• lines 4 and 27 open and close the communication 
channels among the parallel processes; 

• line 6 defines the object space belonging to the 

class generic_lattice with size specified by 
the box; (by default a generic lattice has the topol- 
ogy of a torus but the user can specify a different 
topology. The user can also specify on which pro- 
cessor each lattice site is stored. MDP optimizes the 
communications accordingly) 

• lines 7,8 define the two fields of matrices U (x) and 
V(x); 

• line 9 defines a variable x of class site defined on 
the space; 

• lines 10-13 initialize the fields in parallel; 

• lines 1 4, 1 5 take care of the communication to update 
the copies of the boundary sites; 

• lines 16-24 perform 100 iterations of the algorithm, 
eq. 4; each iteration is automatically paralleUzed 
over the available CPUs; 

• lines 25,26 save the input and output fields. 

Many lattice/grid problems can be solved in a similar 
way. MDP provides some of built-in field classes and the 
user can easily define its own field class which inherit the 
standardized update, load and save member func- 
tions. The standard load/ save functions guarantee the 
portabihty of data different platforms (both parallel and 
non-parallel). 

MDP also features a parallel random number genera- 
tor, i.e. one random generator for each lattice site, that 
insures reproducibility of computations independently on 
the way the lattice is partitioned. 

FERMIQCD @ FERMILAB 

Fermilab is using MDP to develop a general pur- 
pose Object Oriented Lattice QCD application (2), called 
FermiQCD^ The typical problem in QCD (Quantum 
Chromo Dynamics) is that of determining the correla- 
tion functions of the theory as function of the parameters. 
From the knowledge of these correlation functions one 
can extract hadron masses and matrix elements and com- 
pare them with experimental results. This provides both 



FermiQCD can be downloaded from: 
http ; / /thpcl 6 . f nal . gov/f ermiqcd. html 



a useful check of the theory (QCD in particular) and also 
a unique way to extract some of the fundamental parame- 
ters of the Standard Model (for example the CKM matrix 
elements). 

On the lattice, each correlation function is computed 
numerically as the average of the corresponding operator 
applied to elements of a Markov chain of gauge field con- 
figurations. Both the processes of building the Markov 
chain and of measuring operators involve quasi-local al- 
gorithms. 

Some of the main features of FermiQCD are the fol- 
lowing: 

• it supports an arbitrary number of lattices in each 
parallel program and an arbitrary number of fields 
defined on each lattice; 

• each lattice can have an arbitrary dimension, arbi- 
trary topology and arbitrary partitioning; 

• some of the basic built-in fields are: 

gauge_f ield, 
f ermi_f ield, 
staggered_f ield, 
scalar_f ield; 

• gauge_f ields are in the adjoint representation of 
SU (Nc) for an arbitrary Nc. 

The basic parallel algorithms implemented in Fer- 
miQCD are (3): 

• heathbath Monte Carlo to create the Markov chain 
of gauge field configurations; 

• 0{a^) improved heathbath Monte Carlo; 

• minimum residue inversion and stabilized biconju- 
gate gradient inversion for the fermionic matrix; 

• ordinary and stochastic fermionic propagators; 

• ordinary fermionic actions: Wilson, Clover (0{a) 

improved) and D234 (0{a^) improved); 

• staggered fermionic actions: Kogut-Susskind, Lep- 
age (O(a^) improved). 

Moreover FermiQCD is able to read existing Lattice 
QCD data in the CANOPY/ACPMAPS format, in the 
UKQCD format and in the MILC format. 

Here are few examples of FermiQCD Object Oriented 
capabilities (compared with examples in the standard 
textbook notation for Lattice QCD) 
1) QCD: (algebra of Euclidean gamma matrices) 

A = y^ySg^'T' (5) 



FermiQCD: 

Matrix A; 

A=Gamina [mu] *Gamina5*exp (3*I*Gamina [2 ] ) ; 

2) QCD: (multipUcation of a fermionic field for a spin 
structure) 

Vx : x{x) = (y^ -|-m)\|/(x-h/i) (6) 

FermiQCD: 

/* assuming the following definitions 
generic_lattice space_time (...); 
fermi_field chi (space_tiine, Nc) ; 
fermi_field psi (space_tiine,Nc) ; 
site X (space_time) ; 
*/ 

forallsites (x) 

chi (x) = (Gamma [ 3 ] +m) *psi (x+mu) ; 

3) QCD: (translation of a fermionic field) 

\/x,a : Xa{x) = U (x,/i)\|/a(x (7) 

FermiQCD: 

forallsites (x) 

for(a=0; a<psi . Nspin; a++) 

chi (x, a) =U (x, mu) *psi (x+mu, a) ; 
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